Setup

Setup:

Ensure you have completed the necessary setup instructions in Setup specifics below in order to follow along during this lecture. We will be using R for the following analyses and you will need to download certain data files coefHorvath.rds, methylation.rds, prostate.rds if you want to run this code too.

Setup specifics
# Install BiocManager
install.packages("BiocManager", quiet = TRUE)

# Download and read dependencies file
download.file(
  "https://raw.githubusercontent.com/EleanorSC/High-Dimensional-Statistics/main/dependencies.csv",
  destfile = 'dependencies.csv'
)
table <- read.table('dependencies.csv')

# Install dependencies using BiocManager
BiocManager::install(table[[1]])

# Create a directory for data files
dir.create("data", showWarnings = FALSE)

# List of data files to download
data_files <- c(
  "coefHorvath.rds",
  "methylation.rds",
  "prostate.rds"
  # "scRNAsea_data.rds" NOTE: if extension exercises are reached, load this dataset
)

# Download data files into the "data" directory
for (file in data_files) {
  download.file(
    url = file.path(
      "https://raw.githubusercontent.com/EleanorSC/High-Dimensional-Statistics/main/Data",
      file
    ),
    destfile = file.path("data", file)
  )
}

Part 1: Intro to High Dimensional Data

Challenge 1

Descriptions of four research questions and their datasets are given below. Which of these scenarios use high-dimensional data?

  1. Predicting patient blood pressure using: cholesterol level in blood, age, and BMI measurements, collected from 100 patients.
  2. Predicting patient blood pressure using: cholesterol level in blood, age, and BMI, as well as information on 200,000 single nucleotide polymorphisms from 100 patients.
  3. Predicting the length of time patients spend in hospital with pneumonia infection using: measurements on age, BMI, length of time with symptoms, number of symptoms, and percentage of neutrophils in blood, using data from 200 patients.
  4. Predicting probability of a patient’s cancer progressing using gene expression data from 20,000 genes, as well as data associated with general patient health (age, weight, BMI, blood pressure) and cancer growth (tumour size, localised spread, blood test results).

Challenge 1 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers
  1. No. The number of features is relatively small (4 including the response variable since this is an observed variable).

  2. Yes, this is an example of high-dimensional data. There are 200,004 features.

  3. No. The number of features is relatively small (6).

  4. Yes. There are 20,008 features.

Challenge 2

Load the dataset using the here package. Ensure the prostate.rds file is located in the data folder of your project directory.

prostate <- readRDS(here("data/prostate.rds"))

Examine the dataset (in which each row represents a single patient) to:

  1. Determine how many observations (n) and features (p) are available (hint: see the dim() function).

  2. Examine what variables were measured (hint: see the names() and head() functions).

  3. Plot the relationship between the variables (hint: see the pairs() function). What problem(s) with high-dimensional data analysis does this illustrate?

Challenge 2 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers
  1. Using dim() to find dimensions of dataset.
# ----------------------------------------------------------------------
# Examine the Dataset
# ----------------------------------------------------------------------

# The dataset represents clinical data where each row corresponds to a single patient.
# Let's first examine the dimensions of the dataset to understand its structure.

# Determine the number of observations (n) and features (p)
dimensions <- dim(prostate)  # Returns a vector: [number of rows, number of columns]
print(dimensions)  # Print the dimensions
## [1] 97  9
  1. Examining names of variables
# Examine the variables measured in the dataset
variable_names <- names(prostate)  # Returns the column names (variables)
print(variable_names)
## [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
## [8] "pgg45"   "lpsa"
# View the first few rows of the dataset to get a sense of its contents
head(prostate)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
  1. Using the pairs() function. Plotting each pair of variables against each other:
# ----------------------------------------------------------------------
# Visualize Relationships Between Variables
# ----------------------------------------------------------------------

# Plot pairwise relationships between variables using the `pairs()` function.
# This function creates a scatterplot matrix to visualize the relationships
# between all pairs of numeric variables.

pairs(prostate)

Challenge 3

1). Use the cor() function to examine correlations between all variables in the prostate dataset. Are some pairs of variables highly correlated using a threshold of 0.75 for the correlation coefficients?

2). Use the lm() function to fit univariate regression models to predict patient age using two variables that are highly correlated as predictors. Which of these variables are statistically significant predictors of age? Hint: the summary() function can help here.

3). Fit a multiple linear regression model predicting patient age using both variables. What happened?

Challenge 3 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1). Using the cor() function to examine correlations between all variables

# ----------------------------------------------------------------------
# 1. Examine Correlations Between Variables
# ----------------------------------------------------------------------
# Use the cor() function to compute pairwise correlations between all variables 
# in the prostate dataset. Identify pairs of variables with a correlation
# coefficient greater than 0.75 (absolute value).

# Compute the correlation matrix for the dataset
cor(prostate)
##            lcavol      lweight       age         lbph         svi          lcp
## lcavol  1.0000000  0.194128286 0.2249999  0.027349703  0.53884500  0.675310484
## lweight 0.1941283  1.000000000 0.3075286  0.434934636  0.10877851  0.100237795
## age     0.2249999  0.307528614 1.0000000  0.350185896  0.11765804  0.127667752
## lbph    0.0273497  0.434934636 0.3501859  1.000000000 -0.08584324 -0.006999431
## svi     0.5388450  0.108778505 0.1176580 -0.085843238  1.00000000  0.673111185
## lcp     0.6753105  0.100237795 0.1276678 -0.006999431  0.67311118  1.000000000
## gleason 0.4324171 -0.001275658 0.2688916  0.077820447  0.32041222  0.514830063
## pgg45   0.4336522  0.050846821 0.2761124  0.078460018  0.45764762  0.631528245
## lpsa    0.7344603  0.354120390 0.1695928  0.179809410  0.56621822  0.548813169
##              gleason      pgg45      lpsa
## lcavol   0.432417056 0.43365225 0.7344603
## lweight -0.001275658 0.05084682 0.3541204
## age      0.268891599 0.27611245 0.1695928
## lbph     0.077820447 0.07846002 0.1798094
## svi      0.320412221 0.45764762 0.5662182
## lcp      0.514830063 0.63152825 0.5488132
## gleason  1.000000000 0.75190451 0.3689868
## pgg45    0.751904512 1.00000000 0.4223159
## lpsa     0.368986803 0.42231586 1.0000000
# rounding helps to visualise the correlations
round(cor(prostate), 2) 
##         lcavol lweight  age  lbph   svi   lcp gleason pgg45 lpsa
## lcavol    1.00    0.19 0.22  0.03  0.54  0.68    0.43  0.43 0.73
## lweight   0.19    1.00 0.31  0.43  0.11  0.10    0.00  0.05 0.35
## age       0.22    0.31 1.00  0.35  0.12  0.13    0.27  0.28 0.17
## lbph      0.03    0.43 0.35  1.00 -0.09 -0.01    0.08  0.08 0.18
## svi       0.54    0.11 0.12 -0.09  1.00  0.67    0.32  0.46 0.57
## lcp       0.68    0.10 0.13 -0.01  0.67  1.00    0.51  0.63 0.55
## gleason   0.43    0.00 0.27  0.08  0.32  0.51    1.00  0.75 0.37
## pgg45     0.43    0.05 0.28  0.08  0.46  0.63    0.75  1.00 0.42
## lpsa      0.73    0.35 0.17  0.18  0.57  0.55    0.37  0.42 1.00
# As seen above, some variables are highly correlated.
# In particular, the correlation between gleason and pgg45 is equal to 0.75

cor_matrix <- cor(prostate, use = "complete.obs")
high_correlations <- which(abs(cor_matrix) > 0.75 & abs(cor_matrix) < 1, arr.ind = TRUE)

# Extract and display the highly correlated variable pairs
high_cor_pairs <- tibble::tibble(
  Var1 = rownames(cor_matrix)[high_correlations[, 1]],
  Var2 = colnames(cor_matrix)[high_correlations[, 2]],
  Correlation = cor_matrix[high_correlations]
)
print(high_cor_pairs)
## # A tibble: 2 × 3
##   Var1    Var2    Correlation
##   <chr>   <chr>         <dbl>
## 1 pgg45   gleason       0.752
## 2 gleason pgg45         0.752

2). Using the lm() function to predict patient age using two variables that are highly correlated as predictors. Which of these variables are statistically significant predictors of age? (A = gleason + pgg45)

# ----------------------------------------------------------------------
# 2. Fit Univariate Regression Models
# ----------------------------------------------------------------------
# Fitting univariate regression models to predict age using `gleason` and `pgg45` as predictors.
# Use lm() to fit univariate regression models predicting patient age 
# using two highly correlated variables. Evaluate statistical significance
# of each predictor using summary().

model_gleason <- lm(age ~ gleason, data = prostate)
model_pgg45 <- lm(age ~ pgg45, data = prostate)

# Summarize results for each model
summary(model_gleason)  # Check significance of gleason
## 
## Call:
## lm(formula = age ~ gleason, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.780  -3.552   1.448   4.220  13.448 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.146      6.918   6.525 3.29e-09 ***
## gleason        2.772      1.019   2.721  0.00774 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.209 on 95 degrees of freedom
## Multiple R-squared:  0.0723, Adjusted R-squared:  0.06254 
## F-statistic: 7.404 on 1 and 95 DF,  p-value: 0.007741
summary(model_pgg45)  # Check significance of pgg45
## 
## Call:
## lm(formula = age ~ pgg45, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.0889  -3.4533   0.9111   4.4534  15.1822 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 62.08890    0.96758   64.17  < 2e-16 ***
## pgg45        0.07289    0.02603    2.80  0.00619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.193 on 95 degrees of freedom
## Multiple R-squared:  0.07624,    Adjusted R-squared:  0.06651 
## F-statistic:  7.84 on 1 and 95 DF,  p-value: 0.006189

3). Fitting a multiple linear regression model predicting patient age using both variables

# ----------------------------------------------------------------------
# 3. Fit a Multiple Regression Model
# ----------------------------------------------------------------------
# Fit a multiple regression model predicting patient age using both variables.
# Examine what happens when both correlated variables are included.

# Fit a multiple regression model
model_multivar <- lm(age ~ gleason + pgg45, data = prostate)

# Summarize the multiple regression model
summary(model_multivar)
## 
## Call:
## lm(formula = age ~ gleason + pgg45, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.927  -3.677   1.323   4.323  14.420 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.95548    9.74316   5.435  4.3e-07 ***
## gleason      1.45363    1.54299   0.942    0.349    
## pgg45        0.04490    0.03951   1.137    0.259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.198 on 94 degrees of freedom
## Multiple R-squared:  0.08488,    Adjusted R-squared:  0.06541 
## F-statistic: 4.359 on 2 and 94 DF,  p-value: 0.01547

Although gleason and pgg45 have statistically significant univariate effects, this is no longer the case when both variables are simultaneously included as covariates in a multivariate regression model.

Including highly correlated variables such as gleason and pgg45 simultaneously in the same regression model can lead to problems in fitting a regression model and interpreting its output.

Although each variable appears to be associated with the response individually, the model cannot distinguish the contribution of each variable to the model. This can also increase the risk of over-fitting since the model may fit redundant variables to noise rather than true relationships.

To allow the information from variables to be included in the same model despite high levels of correlation, we can use dimensionality reduction methods to collapse multiple variables into a single new variable. We can also use modifications to linear regression like regularisation, which we will cover later.

Using Bioconductor

This lecture focuses on statistical methods for visualising and analysing high-dimensional biological data using Bioconductor packages.

Bioconductor is an open-source platform designed for analysing high-throughput genomic data. It provides a variety of useful packages and example datasets.More details and resources are available at: https://www.bioconductor.org/

Bioconductor packages can be installed and managed using the BiocManager package.

# Let's load the "minfi" package, a Bioconductor package specifically designed
# for analyzing Illumina Infinium DNA methylation arrays.

library("minfi")
browseVignettes("minfi")

Next, we load the methylation dataset which represents data collected using Illumina Infinium methylation arrays which are used to examine methylation across the human genome. These data include information collected from the assay as well as associated metadata from individuals from whom samples were taken.

library("here")
library("ComplexHeatmap")

methylation <- readRDS(here("data/methylation.rds"))
head(colData(methylation))
methyl_mat <- t(assay(methylation))
## calculate correlations between cells in matrix
cor_mat <- cor(methyl_mat)

cor_mat[1:10, 1:10] # print the top-left corner of the correlation matrix
##             cg00075967  cg00374717  cg00864867  cg00945507  cg01027739
## cg00075967  1.00000000 -0.54187597 -0.23545127 -0.03328856  0.15814866
## cg00374717 -0.54187597  1.00000000  0.20194482 -0.01824105 -0.11722829
## cg00864867 -0.23545127  0.20194482  1.00000000 -0.07989013 -0.25388089
## cg00945507 -0.03328856 -0.01824105 -0.07989013  1.00000000  0.32991673
## cg01027739  0.15814866 -0.11722829 -0.25388089  0.32991673  1.00000000
## cg01353448  0.55187311 -0.37130021  0.02410416  0.07201148  0.08380460
## cg01584473 -0.52114045  0.30207176  0.49376101 -0.06213370 -0.04972067
## cg01644850 -0.08130044  0.08972931  0.07465884  0.21406721  0.35845482
## cg01656216 -0.27776967 -0.33015779  0.20665253 -0.14690180 -0.49462321
## cg01873645  0.18174953 -0.10881732 -0.02333667  0.17384071  0.18106477
##              cg01353448  cg01584473   cg01644850   cg01656216  cg01873645
## cg00075967  0.551873113 -0.52114045 -0.081300436 -0.277769670  0.18174953
## cg00374717 -0.371300212  0.30207176  0.089729312 -0.330157788 -0.10881732
## cg00864867  0.024104157  0.49376101  0.074658838  0.206652529 -0.02333667
## cg00945507  0.072011477 -0.06213370  0.214067207 -0.146901796  0.17384071
## cg01027739  0.083804601 -0.04972067  0.358454820 -0.494623205  0.18106477
## cg01353448  1.000000000 -0.24710600 -0.001904264 -0.003425993  0.07776795
## cg01584473 -0.247106001  1.00000000  0.013019805  0.263537096 -0.01537876
## cg01644850 -0.001904264  0.01301980  1.000000000 -0.331328760  0.19820041
## cg01656216 -0.003425993  0.26353710 -0.331328760  1.000000000 -0.24664137
## cg01873645  0.077767947 -0.01537876  0.198200410 -0.246641372  1.00000000

The assay() function creates a matrix-like object where rows represent probes for genes and columns represent samples. We calculate correlations between features in the methylation dataset and examine the first 100 cells of this matrix. The size of the dataset makes it difficult to examine in full, a common challenge in analysing high-dimensional genomics data.

Part 2: Regression with many outcomes using DNAm data

Working with DNAm data

For the following few exercises, we will be working with human DNA methylation (DNAm) data from flow-sorted blood samples, described in data. DNA methylation assays measure, for each of many sites in the genome, the proportion of DNA that carries a methyl mark (a chemical modification that does not alter the DNA sequence). In this case, the methylation data come in the form of a matrix of normalised methylation levels (M-values), where negative values correspond to unmethylated DNA and positive values correspond to methylated DNA. Along with this, we have a number of sample phenotypes (eg, age in years, BMI).

From our methylation.rds data we can extract:

the dimensions of the dataset using dim(). Importantly, in these objects and data structures for computational biology in R generally, observations are stored as columns and features (in this case, sites in the genome) are stored as rows.

This is in contrast to usual tabular data, where features or variables are stored as columns and observations are stored as rows; assays, (normalised methylation levels here), using assay(); sample-level information using colData().

library("here")
library("minfi")
methylation <- readRDS(here("data/methylation.rds"))

# Check the dimensions of the dataset using dim().

dim(methylation)
## [1] 5000   37
# The output shows that the object has dimensions of 5000 × 37,
# indicating 5000 features and 37 observations.
# To extract the matrix of methylation M-values, use the assay() function.

methyl_mat <- assay(methylation)

hist(methyl_mat, xlab = "M-value")

head(colData(methylation))
## DataFrame with 6 rows and 14 columns
##                     Sample_Well Sample_Name    purity         Sex       Age
##                     <character> <character> <integer> <character> <integer>
## 201868500150_R01C01         A07     PCA0612        94           M        39
## 201868500150_R03C01         C07   NKpan2510        95           M        49
## 201868500150_R05C01         E07      WB1148        95           M        20
## 201868500150_R07C01         G07       B0044        97           M        49
## 201868500150_R08C01         H07   NKpan1869        95           F        33
## 201868590193_R02C01         B03   NKpan1850        93           F        21
##                     weight_kg  height_m       bmi    bmi_clas Ethnicity_wide
##                     <numeric> <numeric> <numeric> <character>    <character>
## 201868500150_R01C01   88.4505    1.8542   25.7269  Overweight          Mixed
## 201868500150_R03C01   81.1930    1.6764   28.8911  Overweight  Indo-European
## 201868500150_R05C01   80.2858    1.7526   26.1381  Overweight  Indo-European
## 201868500150_R07C01   82.5538    1.7272   27.6727  Overweight  Indo-European
## 201868500150_R08C01   87.5433    1.7272   29.3452  Overweight  Indo-European
## 201868590193_R02C01   87.5433    1.6764   31.1507       Obese          Mixed
##                        Ethnic_self      smoker       Array       Slide
##                        <character> <character> <character>   <numeric>
## 201868500150_R01C01       Hispanic          No      R01C01 2.01869e+11
## 201868500150_R03C01      Caucasian          No      R03C01 2.01869e+11
## 201868500150_R05C01        Persian          No      R05C01 2.01869e+11
## 201868500150_R07C01      Caucasian          No      R07C01 2.01869e+11
## 201868500150_R08C01      Caucasian          No      R08C01 2.01869e+11
## 201868590193_R02C01 Finnish/Creole          No      R02C01 2.01869e+11
# Association between age and DNA methylation:
# The following heatmap summarises age and methylation levels available in the methylation dataset:

library("ComplexHeatmap")

age <- methylation$Age

# sort methylation values by age 
order <- order(age)
age_ord <- age[order]
methyl_mat_ord <- methyl_mat[, order]

# plot heatmap
Heatmap(methyl_mat_ord,
        cluster_columns = FALSE,
        show_row_names = FALSE,
        show_column_names = FALSE,
        name = "M-value",
        row_title = "Feature", 
        column_title =  "Sample", 
        top_annotation = columnAnnotation(age = age_ord))

Challenge 1

  1. Why can’t we simply fit many linear regression models for every combination of features (colData and assays) and draw conclusions based on significant p-values?

Challenge 1 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

There are several problems with this approach:

1. Multiple Testing Problem: - If we perform 5000 tests for each of 14 variables, even with no true associations, random noise would produce some significant (spurious) results.

2. Small Sample Size - Some covariates may have very small sample sizes (e.g., certain ethnicities), leading to unreliable or spurious findings.

3. Lack of Research Focus - Without a clear research question, interpreting each model becomes ambiguous. - Rationalising findings after the fact can lead to creating unsupported “stories” based solely on significant results.

Fitting a linear model to the DNAm data

Modeling Methylation with Age

In the data we have read in, we have a matrix of methylation values \(X\) and a vector of ages, \(y\). One way to model this is to see if we can use age to predict the expected (average) methylation value for sample \(j\) at a given locus \(i\), which can be written as:

\[ E(X_{ij}) = \beta_0 + \beta_1 \text{Age}_j \]

where \(\text{Age}_j\) is the age of sample \(j\). In this model, \(\beta_1\) represents the unit change in mean methylation level for each unit (year) change in age.

For a specific CpG, we can fit this model and extract more information from the model object. For illustration purposes, we arbitrarily select the first CpG in the methyl_mat matrix (the one on its first row).

# Arbitrarily select the first CpG in the methyl_mat matrix (the one on its first row):

age <- methylation$Age

# methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable
lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age)
lm_age_methyl1
## 
## Call:
## lm(formula = methyl_mat[1, ] ~ age)
## 
## Coefficients:
## (Intercept)          age  
##    0.902334     0.008911

We now have estimates for the expected methylation level when age equals 0 (the intercept) and the change in methylation level for a unit change in age (the slope). We could plot this linear model:

plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16)
abline(lm_age_methyl1)

#A scatter plot of age versus the methylation level for an arbitrarily selected CpG side (the one stored as the first column of methyl_mat). Each dot represents an individual. The black line represents the estimated linear model.

For this feature, we can see that there is no strong relationship between methylation and age. We could try to repeat this for every feature in our dataset; however, we have a lot of features! We need an approach that allows us to assess associations between all of these features and our outcome while addressing the three considerations we outlined previously. Before we introduce this approach, let’s go into detail about how we generally check whether the results of a linear model are statistically significant.

Hypothesis Testing in Linear Models

Using the linear model above, we can test whether age affects methylation values for a given CpG via hypothesis testing. Specifically, we test the null hypothesis that \(\beta_1\), the regression coefficient for age, equals zero. If \(\beta_1 = 0\), there is no linear relationship between age and methylation level at the CpG.

The linear model output provides results associated with this null hypothesis by default. It compares observed results (estimated coefficients) to what we might expect under the null hypothesis if the experiment were repeated many times.

# For this linear model, we can use tidy() from the broom package to extract detailed
# information about the coefficients and the associated hypothesis tests in this model:

library("broom")
tidy(lm_age_methyl1)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.902      0.344      2.62   0.0129
## 2 age          0.00891    0.0100     0.888  0.381

Understanding Linear Model Output

  • Standard errors (std.error): Represent uncertainty in regression coefficient estimates (effect sizes).
  • Test statistics and p-values: Measure how unlikely observed results are under the null hypothesis.

For coefficient \(k\) (e.g., slope), the t-statistic is calculated as:

\[ t_k = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \]

Here, \(SE(\hat{\beta}_k)\) quantifies uncertainty in the effect size estimate. Knowing the null distribution of \(t_k\) helps us assess the likelihood of observing these results if the null hypothesis were true.

Challenge 2

  1. In the model we fitted, the estimate for the intercept is 0.902 and its associated p-value is 0.0129. What does this mean?

Challenge 2 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

The intercept estimate of 0.902 represents the mean methylation value for the first CpG when age is zero. However, this is not meaningful since there are no observations with age below 20 in the dataset. The p-value tests whether the intercept (\(\beta_0\)) is zero, but this is irrelevant to the analysis as we are more interested in the regression coefficient for age (\(\beta_1\)), which indicates whether there is a linear relationship between age and methylation.

Sharing information across outcome variables using limma

Sharing Information Across Features

In structured data, such as DNA methylation assays, we can leverage shared characteristics (e.g., features measured using the same technique) to pool information across features. This reduces noise and provides more robust estimates of uncertainty compared to analyzing each feature in isolation.

In a linear model, the t-statistic for a coefficient (\(\beta_k\)) is:

\[ t_k = \frac{\hat{\beta_k}}{SE(\hat{\beta_k})} \]

Large effect sizes with small standard errors lead to small p-values. However, random noise in small datasets can cause artificially low standard errors, leading to false positives.

The limma package addresses this by pooling information across features and moderating standard errors, shrinking them towards a common value. This approach produces moderated t-statistics and reduces the risk of false positives. Originally developed for gene expression microarrays, limma is widely used in RNAseq and other genomics data.

In limma, the model setup involves creating a design matrix, which defines the coefficients to fit in each linear model. This allows consistent modeling across multiple features.

# What is a model matrix?

design_age <- model.matrix(lm_age_methyl1) # model matrix
head(design_age)
##                     (Intercept) age
## 201868500150_R01C01           1  39
## 201868500150_R03C01           1  49
## 201868500150_R05C01           1  20
## 201868500150_R07C01           1  49
## 201868500150_R08C01           1  33
## 201868590193_R02C01           1  21
dim(design_age)
## [1] 37  2

The model matrix has the same number of rows as the methylation data (samples) and two columns: one for the intercept and one for age. While lm() handles this automatically, in limma, we explicitly define it.

Using lmFit(), the methylation data is modeled efficiently for each row. Applying eBayes() to the output pools standard errors, producing moderated t-statistics and p-values. This approach ensures robust statistical estimates. For more complex designs, refer to the limma user manual.

library("limma")
design_age <- model.matrix(lm_age_methyl1) # model matrix
fit_age <- lmFit(methyl_mat, design = design_age)
fit_age <- eBayes(fit_age)

To extract linear model results, use topTable(). By default, it returns results for the first coefficient (intercept). Since we are interested in the age coefficient, set coef = 2. To display all results, specify number = nrow(fit_age).

# Use the topTable() function to obtain the results of the linear models.
toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age))
head(toptab_age)
##                  logFC    AveExpr         t      P.Value   adj.P.Val        B
## cg08446924 -0.02571353 -0.4185868 -6.039068 5.595675e-07 0.002797837 5.131574
## cg06493994  0.01550941 -2.1057877  5.593988 2.239813e-06 0.005599533 3.747986
## cg17661642  0.02266668 -2.0527722  5.358739 4.658336e-06 0.006048733 3.019698
## cg05168977  0.02276336 -2.2918472  5.346500 4.838987e-06 0.006048733 2.981904
## cg24549277  0.01975577 -1.7466088  4.939242 1.708355e-05 0.011508818 1.731821
## cg04436528 -0.01943612  0.7033503 -4.917179 1.828563e-05 0.011508818 1.664608
# The output of topTable includes several columns:
# - logFC: The coefficient estimate (log fold change), representing effect size.
# - aveExpr: The average expression level.
# - t: The t-statistic for the coefficient.
# - P.Value: The p-value for the test.
# - adj.P.Val: The adjusted p-value (we'll discuss adjusted p-values shortly).
# - B: The log-odds that a feature is significantly different (a transformation of the p-value, not covered here).
# The term logFC is used for historical reasons from microarray experiments.

These results provide effect sizes and p-values for the association between methylation levels at each locus and age across the 37 samples.

To visualise effect sizes (coefficients) and statistical significance (p-values), we can plot effect sizes against p-values for all linear models. Such plots are called “volcano plots” because of their shape.

plot(toptab_age$logFC, -log10(toptab_age$P.Value),
     xlab = "Effect size", ylab = bquote(-log[10](p-value)),
     pch = 19
)

# In this figure:
# - Each point represents a feature of interest.
# - The x-axis shows the effect size from a linear model.
# - The y-axis shows −log10(p-value), where higher values indicate stronger statistical evidence
#   of a non-zero effect size.

# - Positive effect sizes indicate increasing methylation with age.
# - Negative effect sizes indicate decreasing methylation with age.
# - Points higher on the y-axis represent features with results unlikely under the null hypothesis.

# The goal is to identify features with different methylation levels across age groups.
# Ideally, there would be a clear separation between “null” (no effect) and “non-null” (effect exists) features.
# However, in practice, we often see a continuum of effect sizes and p-values without clear separation.

# Statistical methods can provide insights from these continuous measures,
# but it is often useful to generate a list of features with confident non-zero effect sizes.
# This is challenging due to the large number of tests performed.

Challenge 3

  1. Try fitting a linear model using smoking status as a covariate instead of age, and create a volcano plot. Note: Smoking status is stored as methylation$smoker.

  2. In the lecture example, we saw that information sharing can result in larger p-values. Why might this be preferable?

Challenge 3 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. Fitting linear model using smoking status as a covariate instead of age:

# ----------------------------------------------------------------------
# CHALLENGE 3
# ----------------------------------------------------------------------

# 1. 
design_smoke <- model.matrix(~methylation$smoker)
fit_smoke <- lmFit(methyl_mat, design = design_smoke)
fit_smoke <- eBayes(fit_smoke)
toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke))
plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value),
     xlab = "Effect size", ylab = bquote(-log[10](p)),
     pch = 19
)

2. Information sharing can result in larger p-values. Why might this be preferable?:

Being more conservative when identifying features can help reduce false discoveries. It is also important to be cautious when rejecting the null hypothesis based on small standard errors caused by abnormally low variability for certain features.

The problem with multiple tests

With a large number of features, it is important to determine which features are “interesting” or “significant” for further study.

Using a standard significance threshold of 0.05 may lead to many false positives. A p-value threshold of 0.05 means there is a 1 in 20 chance of observing results as extreme or more extreme under the null hypothesis (no association between age and methylation).

If we perform many more than 20 tests, we are likely to observe significant p-values purely due to random chance, even when the null hypothesis is true.

To illustrate this, we can permute (scramble) the age values and rerun the test to see how random chance affects the results.

# ----------------------------------------------------------------------

# Permute (scramble) the age values and rerun the test to see how random chance affects the results.

set.seed(123) 
age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]
design_age_perm <- model.matrix(~age_perm)

fit_age_perm <- lmFit(methyl_mat, design = design_age_perm)
fit_age_perm <- eBayes(fit_age_perm)
toptab_age_perm <- topTable(fit_age_perm, coef = 2, number = nrow(fit_age_perm))

plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value),
     xlab = "Effect size", ylab = bquote(-log[10](p)),
     pch = 19
)
abline(h = -log10(0.05), lty = "dashed", col = "red")

A random sequence of ages was generated, so no true association between methylation levels and this random sequence is expected. Despite this, many features still show p-values below the traditional significance threshold of p = 0.05. In this example, 226 features are significant at p < 0.05, which demonstrates how using this threshold in a real experiment could falsely identify features as associated with age purely by chance.

When performing multiple tests, features are classified as either “significant” or “non-significant,” but some classifications will inevitably be incorrect. Results fall into four categories:

  • True Positive: Truly different and correctly labeled as different.
  • False Negative: Truly different but incorrectly labeled as not different.
  • False Positive (False Discovery): Not truly different but incorrectly labeled as different.
  • True Negative: Not truly different and correctly labeled as not different.

At a significance level of 5%, 5% of results will be false positives (false discoveries) by chance, as p-values are uniformly distributed under the null hypothesis.

To control false discoveries, the Bonferroni correction can be applied:

Divide the significance threshold by the number of tests (n). Alternatively, multiply the p-values by the number of tests. However, Bonferroni is very conservative, especially with a large number of features.

# BONFERRONI correction – which divides the significance level by 
# the number of tests performed (n);  Family-Wise Error Rate (fwer)

p_raw <- toptab_age$P.Value
p_fwer <- p.adjust(p_raw, method = "bonferroni")
plot(p_raw, p_fwer, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")

Challenge 4

  1. At a significance level of 0.05, with 100 tests, what is the Bonferroni significance threshold?

  2. In a gene expression experiment, after FDR correction with an adjusted p-value threshold of 0.05, 500 significant genes are observed. What proportion of these genes are truly different?

Challenge 4 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. At a significance level of 0.05, with 100 tests, what is the Bonferroni significance threshold?

# Threshold = 0.05 / 100  = 0.0005

2. In a gene expression experiment, after FDR correction with an adjusted p-value threshold of 0.05, 500 significant genes are observed. What proportion of these genes are truly different?

# ----------------------------------------------------------------------
# CHALLENGE 4
# ----------------------------------------------------------------------

# 2. We can’t say what proportion of these genes are truly different. However, if we repeated this
# experiment and statistical test over and over, on average 5% of the results from each run
# would be false discoveries.

# Try applying FDR correction to the p_raw vector.
# Hint: Use the p.adjust() function and check help("p.adjust") for details on the method.
# A = The following code runs FDR correction and compares it to non-corrected values and to Bonferroni:

p_fdr <- p.adjust(p_raw, method = "BH")
plot(p_raw, p_fdr, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")

# plot of chunk plot-fdr-fwer
plot(p_fwer, p_fdr, pch = 16, log="xy")
abline(0:1, lty = "dashed")
abline(v = 0.05, lty = "dashed", col = "red")
abline(h = 0.05, lty = "dashed", col = "red")

Part 3: Regularisation

Regularisation, also called penalised regression, is used for prediction and feature selection, particularly in high-dimensional data. It stabilises coefficient estimates, enabling models to handle more features than observations.

Standard linear models fail in high-dimensional settings because they cannot fit cases with more predictors than data points. While previous methods shared information across models, regularisation provides a faster and efficient alternative.

Next, we’ll explore this by applying regularisation to methylation data.

# ----------------------------------------------------------------------
# REGULARISATION
# ----------------------------------------------------------------------

library("here")
library("minfi")
methylation <- readRDS(here("data/methylation.rds"))

## here, we transpose the matrix to have features as rows and samples as columns
methyl_mat <- t(assay(methylation))
age <- methylation$Age

Then, we try to fit a model with outcome age and all 5,000 features in this dataset as predictors (average methylation levels, M-values, across different sites in the genome).

# by using methyl_mat in the formula below, R will run a multivariate regression
# model in which each of the columns in methyl_mat is used as a predictor. 
fit <- lm(age ~ methyl_mat)
summary(fit)

You can see that we’re able to get some effect size estimates, but they seem very high! The summary also says that we were unable to estimate effect sizes for 4,964 features because of “singularities”. We clarify what singularities are in the note below but this means that R couldn’t find a way to perform the calculations necessary to fit the model. Large effect sizes and singularities are common when naively fitting linear regression models with a large number of features (i.e., to high-dimensional data), often since the model cannot distinguish between the effects of many, correlated features or when we have more features than observations.

Singularities

A singularity occurs when a matrix used in a linear model cannot be inverted, which is necessary for calculating the model’s coefficients. This happens if:

There are more features (predictors) than observations. The features are highly correlated. In such cases, R cannot perform the required calculations and returns an error about singularities.

xtx <- t(methyl_mat) %*% methyl_mat
det(xtx)
## [1] 0

Correlated features

In high-dimensional datasets, there are often multiple features that contain redundant information (correlated features). If we visualise the level of correlation between sites in the methylation dataset, we can see that many of the features represent the same information - there are many off-diagonal cells, which are deep red or blue. For example, the following heatmap visualises the correlations for the first 500 features in the methylation dataset (we selected 500 features only as it can be hard to visualise patterns when there are too many features!).

library("ComplexHeatmap")
small <- methyl_mat[, 1:500]
cor_mat <- cor(small)
Heatmap(cor_mat,
    column_title = "Feature-feature correlation in methylation data",
    name = "Pearson correlation",
    show_row_dend = FALSE, show_column_dend = FALSE,
    show_row_names = FALSE, show_column_names = FALSE
)

Challenge 1

Consider the following questions:

  1. Why would we observe correlated features in high-dimensional biological data?

  2. Why might correlated features be a problem when fitting linear models?

  3. What issue might correlated features present when selecting features to include in a model one at a time?

Challenge 1 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. Why would we observe correlated features in high-dimensional biological data? Many of the features in biological data represent very similar information biologically. For example, sets of genes that form complexes are often expressed in very similar quantities. Similarly, methylation levels at nearby sites are often very highly correlated.

2.Why might correlated features be a problem when fitting linear models? Correlated features can make inference unstable or even impossible mathematically.

3. What issue might correlated features present when selecting features to include in a model one at a time? When we are selecting features one at a time we want to pick the most predictive feature each time. When a lot of features are very similar but encode slightly different information, which of the correlated features we select to include can have a huge impact on the later stages of model selection!

Challenge 2

  1. What are we minimising when we fit a linear model?
  2. Why are we minimising this objective? What assumptions are we making about our data when we do so?

Challenge 2 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. What are we minimising when we fit a linear model?

When we fit a linear model we are minimising the squared error!

2. Why are we minimising this objective? What assumptions are we making about our data when we do so?

We minimise squared error because it ignores residual signs, penalises large errors more heavily, and is computationally simple. Least squares assumes residuals are normally distributed with constant variance after accounting for linear relationships.

Model selection using training and test sets

Sets of models are often compared using statistics such as adjusted R^2, AIC or BIC. These show us how well the model is learning the data used in fitting that same model 1. However, these statistics do not really tell us how well the model will generalise to new data. This is an important thing to consider – if our model doesn’t generalise to new data, then there’s a chance that it’s just picking up on a technical or batch effect in our data, or simply some noise that happens to fit the outcome we’re modelling. This is especially important when our goal is prediction – it’s not much good if we can only predict well for samples where the outcome is already known, after all!

To get an idea of how well our model generalises, we can split the data into two - a “training” and a “test” set. We use the “training” data to fit the model, and then see its performance on the “test” data.

One thing that often happens in this context is that large coefficient values minimise the training error, but they don’t minimise the test error on unseen data. First, we’ll go through an example of what exactly this means.

To compare the training and test errors for a model of methylation features and age, we’ll split the data into training and test sets, fit a linear model and calculate the errors. First, let’s calculate the training error. Let’s start by splitting the data into training and test sets:

methylation <- readRDS(here::here("data/methylation.rds"))

library("SummarizedExperiment")
age <- methylation$Age
methyl_mat <- t(assay(methylation))

Subset significant CpGs:

cpg_markers <- c("cg16241714", "cg14424579", "cg22736354", "cg02479575", "cg00864867", 
    "cg25505610", "cg06493994", "cg04528819", "cg26297688", "cg20692569", 
    "cg04084157", "cg22920873", "cg10281002", "cg21378206", "cg26005082", 
    "cg12946225", "cg25771195", "cg26845300", "cg06144905", "cg27377450"
)

horvath_mat <- methyl_mat[, cpg_markers]

## Generate an index to split the data
set.seed(42)
train_ind <- sample(nrow(methyl_mat), 25)

## Split the data 
train_mat <- horvath_mat[train_ind, ]
train_age <- age[train_ind]
test_mat <- horvath_mat[-train_ind, ]
test_age <- age[-train_ind]

Now let’s fit a linear model to our training data and calculate the training error. Here we use the mean of the squared difference between our predictions and the observed data, or “mean squared error” (MSE).

## Fit a linear model
# as.data.frame() converts train_mat into a data.frame
# Using the `.` syntax above together with a `data` argument will lead to
# the same result as using `train_age ~ train_mat`: R will fit a multivariate 
# regression model in which each of the columns in `train_mat` is used as 
# a predictor. We opted to use the `.` syntax because it will help us to 
# obtain model predictions using the `predict()` function. 

fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))

## Function to calculate the (mean squared) error
mse <- function(true, prediction) { 
    mean((true - prediction)^2) 
} 

## Calculate the training error 
err_lm_train <- mse(train_age, fitted(fit_horvath)) 
err_lm_train
## [1] 1.319628

Challenge 3

  1. For the fitted model above, calculate the mean squared error for the test set.

Challenge 3 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. Calculate the mean squared error for the test set:

First, let’s find the predicted values for the ‘unseen’ test data:

pred_lm <- predict(fit_horvath, newdata = as.data.frame(test_mat)) 

The mean squared error for the test set is the mean of the squared error between the predicted values and true test data.

err_lm <- mse(test_age, pred_lm)
err_lm
## [1] 223.3571

Unfortunately, the test error is a lot higher than the training error. If we plot true age against predicted age for the samples in the test set, we can gain more insight into the performance of the model on the test set. Ideally, the predicted values should be close to the test data.

par(mfrow = c(1, 1))
plot(test_age, pred_lm, pch = 19)
abline(coef = 0:1, lty = "dashed")

Ridge regression

Ridge regression adds a penalty to the sum of squared coefficients (L2 norm) to control model complexity. This penalty, scaled by a factor λ, balances reducing large coefficients and fitting the data.

\[ \sum_{i=1}^N (y_i - x'_i \beta)^2 + \lambda \|\beta\|_2^2 \]

  • When λ is large: Large coefficients are heavily penalized, simplifying the model.
  • When λ is small: Coefficients shrink less, resulting in a more complex model.
  • When λ = 0: Ridge regression reduces to ordinary least squares.

We’ll use the glmnet package to compare regularized and ordinary least squares models using a subset of 20 features (cpg_markers) identified earlier.

library("glmnet")

## glmnet() performs scaling by default, supply un-scaled data:
horvath_mat <- methyl_mat[, cpg_markers] # select the same 20 sites as before
train_mat <- horvath_mat[train_ind, ] # use the same individuals as selected before
test_mat <- horvath_mat[-train_ind, ]

ridge_fit <- glmnet(x = train_mat, y = train_age, alpha = 0)
plot(ridge_fit, xvar = "lambda")
abline(h = 0, lty = "dashed")

Since we split the data into test and training data, we can prove that ridge regression predicts the test data better than the model with no regularisation. Let’s generate our predictions under the ridge regression model and calculate the mean squared error in the test set:

# Obtain a matrix of predictions from the ridge model,
# where each column corresponds to a different lambda value
pred_ridge <- predict(ridge_fit, newx = test_mat)

# Calculate MSE for every column of the prediction matrix against the vector of true ages
err_ridge <- apply(pred_ridge, 2, function(col) mse(test_age, col)) 
min_err_ridge <- min(err_ridge)

# Identify the lambda value that results in the lowest MSE (ie, the "best" lambda value)
which_min_err <- which.min(err_ridge)
pred_min_ridge <- pred_ridge[, which_min_err]

## Return errors
min_err_ridge
## [1] 46.76802

This is much lower than the test error for the model without regularisation:

err_lm 
## [1] 223.3571

We can see where on the continuum of lambdas we’ve picked a model by plotting the coefficient paths again. In this case, we’ve picked a model with fairly modest coefficient shrinking.

chosen_lambda <- ridge_fit$lambda[which.min(err_ridge)]
plot(ridge_fit, xvar = "lambda")
abline(v = log(chosen_lambda), lty = "dashed")

Challenge 4

  1. Which performs better, ridge or OLS?

  2. Plot predicted ages for each method against the true ages. How do the predictions look for both methods? Why might ridge be performing better?

Challenge 4 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. Which performs better, ridge or OLS?

Ridge regression performs significantly better on unseen data, despite being “worse” on the training data.

min_err_ridge
## [1] 46.76802
err_lm
## [1] 223.3571

2. Plot predicted ages for each method against the true ages. How do the predictions look for both methods? Why might ridge be performing better?

The ridge ones are much less spread out with far fewer extreme predictions.

all <- c(pred_lm, test_age, pred_min_ridge)
lims <- range(all)
par(mfrow = 1:2)
plot(test_age, pred_lm,
    xlim = lims, ylim = lims,
    pch = 19
)
abline(coef = 0:1, lty = "dashed")
plot(test_age, pred_min_ridge,
    xlim = lims, ylim = lims,
    pch = 19
)
abline(coef = 0:1, lty = "dashed")

#Two plots showing OLS predictions (left) and ridge regression predictions (right) of age (y) against true age (x).

Why Ridge Performs Better:

  • Ridge reduces the variance of the model by stabilising coefficient estimates.

  • It prioritizes generalisability, producing predictions that are more robust to noise.

  • While OLS attempts to minimize the residual sum of squares without constraints, ridge adds a penalty term, effectively balancing model fit and simplicity.

LASSO Regularization

LASSO regularization uses the L1 norm (sum of absolute coefficient values) to shrink coefficients:

\[ \| \beta \|_1 = \sum_{j=1}^p |\beta_j| \]

This approach encourages sparse models, removing unnecessary features by shrinking some coefficients exactly to zero. The sharp boundaries of the restricted region make it more likely that solutions lie at corners, resulting in simpler models.

Cross-validation to find the best value of λ

To balance model complexity and information retention, we choose an optimal λ. A common method is cross-validation, which splits the data into K chunks. K−1 chunks are used for training, and the remaining chunk is for testing. This process rotates through all chunks, providing a reliable estimate of how well each λ value generalizes to new data.

We can use this new idea to choose a lambda value by finding the lambda that minimises the error across each of the test and training splits. In R:

# fit lasso model with cross-validation across a range of lambda values
lasso <- cv.glmnet(methyl_mat, age, alpha = 1)
plot(lasso)

# Extract the coefficients from the model with the lowest mean squared error from cross-validation
coefl <- coef(lasso, lasso$lambda.min)
# select only non-zero coefficients
selection <- which(coefl != 0)
# and convert to a normal matrix
selected_coefs <- as.matrix(coefl)[selection, 1]
selected_coefs
##   (Intercept)    cg02388150    cg06493994    cg22449114    cg22736354 
## -8.4133296328  0.6966503392  0.1615535465  6.4255580409 12.0507794749 
##    cg03330058    cg09809672    cg11299964    cg19761273    cg26162695 
## -0.0002362055 -0.7487594618 -2.0399663416 -5.2538055304 -0.4486970332 
##    cg09502015    cg24771684    cg08446924    cg13215762    cg24549277 
##  1.0787003366  4.5743800395 -0.5960137381  0.1481402638  0.6290915767 
##    cg12304482    cg13131095    cg17962089    cg13842639    cg04080666 
## -1.0167896196  2.8860222552  6.3065284096  0.1590465147  2.4889065761 
##    cg06147194    cg03669936    cg14230040    cg19848924    cg23964682 
## -0.6838637838 -0.0352696698  0.1280760909 -0.0006938337  1.3378854603 
##    cg13578928    cg02745847    cg17410295    cg17459023    cg06223736 
## -0.8601170264  2.2346315955 -2.3008028295  0.0370389967  1.6158734083 
##    cg06717750    cg20138604    cg12851161    cg20972027    cg23878422 
##  2.3401693309  0.0084327521 -3.3033355652  0.2442751682  1.1059030593 
##    cg16612298    cg03762081    cg14428146    cg16908369    cg16271524 
##  0.0050053190 -6.5228858163  0.3167227488  0.2302773154 -1.3787104336 
##    cg22071651    cg04262805    cg24969251    cg11233105    cg03156032 
##  0.3480551279  1.1841804186  8.3024629942  0.6130598151 -1.1121959544

We can see that cross-validation has selected a value of λ resulting in 44 features and the intercept.

Elastic net regression

Elastic net regression combines the properties of ridge (α = 0) and LASSO (α = 1) regression, blending their advantages. It drops uninformative variables like LASSO while maintaining conservative coefficient estimates like ridge, leading to improved predictions.

Elastic net optimizes the following: \[ \sum_{i=1}^N \left( y_i - x'_i\beta \right)^2 + \lambda \left( \alpha \|\beta\|_1 + (1-\alpha) \|\beta\|_2^2 \right) \]

  • When α = 1: Pure LASSO regression (only L1 penalty applies).
  • When α = 0: Pure ridge regression (only L2 penalty applies).
  • For values between 0 and 1: A mix of LASSO and ridge properties.

This flexibility allows elastic net to balance variable selection and prediction accuracy. Contour plots help visualize how the penalty changes for different α values.

Challenge 5

  1. Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model object.

  2. Fit an elastic net model with cross-validation and plot the error. Compare with LASSO.

Challenge 5 Solution:

After attempting to answer these yourself, click on “Show Answers”.

Show Answers

1. Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model object.

  1. Fitting an elastic net model is just like fitting a LASSO model. You can see that coefficients tend to go exactly to zero, but the paths are a bit less extreme than with pure LASSO; similar to ridge.
elastic <- glmnet(methyl_mat[, -1], age, alpha = 0.5)
plot(elastic)

#Line plot showing coefficient values from an elastic net model against L1 norm.

2. Fit an elastic net model with cross-validation and plot the error. Compare with LASSO.

The process of model selection is similar for elastic net models as for LASSO models.

elastic_cv <- cv.glmnet(methyl_mat[, -1], age, alpha = 0.5)
plot(elastic_cv)

The plot here depicts mean squared error (MSE) against discrete values of lambda, with red points showing the average mean squared error and grey error bars showing the 1 standard error interval around them.

The MSE rises as log lambda increases, indicating a larger prediction error. Two dashed lines indicate the lambda value corresponding to the lowest overall MSE value and the lambda value corresponding to the lambda value with MSE with one standard error of the minimum.

Part 4: Extension exercises – Clustering

K-Means Clustering:

Partitional method that divides data into k predefined clusters based on minimizing variance within clusters. Best for large datasets with clearly defined spherical clusters.

Hierarchical Clustering:

Builds a tree (dendrogram) of nested clusters based on similarity. Useful for small datasets or when exploring hierarchical relationships.

Which to use?

Use K-means for speed and scalability; use hierarchical for interpretability and when the number of clusters is unknown.

K-means clustering

Cluster analysis involves finding groups of observations that are more similar to each other (according to some feature) than they are to observations in other groups and are thus likely to represent the same source of heterogeneity. Once groups (or clusters) of observations have been identified using cluster analysis, further analyses or interpretation can be carried out on the groups, for example, using metadata to further explore groups.

Note: Cluster analysis is commonly used to discover unknown groupings in fields such as bioinformatics, genomics, and image processing, in which large datasets that include many features are often produced.

There are various ways to look for clusters of observations in a dataset using different clustering algorithms. One way of clustering data is to minimise distance between observations within a cluster and maximise distance between proposed clusters. Using this process, we can also iteratively update clusters so that we become more confident about the shape and size of the clusters.

What is K-means clustering?

K-means clustering groups data points into a user-defined number of distinct, non-overlapping clusters. To create clusters of ‘similar’ data points, K-means clustering creates clusters that minimise the within-cluster variation and thus the amount that data points within a cluster differ from each other. The distance between data points within a cluster is used as a measure of within-cluster variation.

To carry out K-means clustering, we first pick k initial points as centres or “centroids” of our clusters. There are a few ways to choose these initial “centroids” and this is discussed below. Once we have picked intitial points, we then follow these two steps until appropriate clusters have been formed:

1. Assign each data point to the cluster with the closest centroid.

2. Update centroid positions as the average of the points in that cluster.

While K-means has some advantages over other clustering methods (easy to implement and to understand), it does have some disadvantages, particularly difficulties in identifying initial clusters which observations belong to and the need for the user to specify the number of clusters that the data should be partitioned into.

K-means clustering applied to the single-cell RNA sequencing data

Let’s carry out K-means clustering in R using some real high-dimensional data. We’re going to work with single-cell RNA sequencing data, scRNAseq_data, in these clustering challenges, which is often very high-dimensional. Commonly, experiments profile the expression level of 10,000+ genes in thousands of cells. Even after filtering the data to remove low quality observations, the dataset we’re using in this episode contains measurements for over 9,000 genes in over 3,000 cells.

One way to get a handle on a dataset of this size is to use something we covered earlier in the course - dimensionality reduction. Dimensionality reduction allows us to visualise this incredibly complex data in a small number of dimensions. In this case, we’ll be using principal component analysis (PCA) to compress the data by identifying the major axes of variation in the data, before running our clustering algorithms on this lower-dimensional data to explore the similarity of features.

The scater package has some easy-to-use tools to calculate a PCA for SummarizedExperiment objects. Let’s load the scRNAseq_data and calculate some principal components.

library("SingleCellExperiment")
library("scater")

scrnaseq <- readRDS(here::here("data/scRNAseq_data.rds"))
scrnaseq <- runPCA(scrnaseq, ncomponents = 15)
pcs <- reducedDim(scrnaseq)[, 1:2]

The first two principal components capture almost 50% of the variation within the data. For now, we’ll work with just these two principal components, since we can visualise those easily, and they’re a quantitative representation of the underlying data, representing the two largest axes of variation.

We can now run K-means clustering on the first and second principal components of the scRNAseq_data using the kmeans() function.

set.seed(42)
cluster <- kmeans(pcs, centers = 4)
scrnaseq$kmeans <- as.character(cluster$cluster)
plotReducedDim(scrnaseq, "PCA", colour_by = "kmeans")

We can see that this produces a sensible-looking partition of the data. However, is it totally clear whether there might be more or fewer clusters here?

Extension Challenge 1

  1. Perform clustering to group the data into k=5 clusters, and plot it using plotReducedDim(). Save this with a variable name that’s different to what we just used, because we’ll use this again later.

Extension Challenge 1 Solution:

Above is a problem for you to solve; when you’re ready and have had an attempt yourself, click on “Show Answer” to reveal the code.

Show Answer
set.seed(42)
cluster5 <- kmeans(pcs, centers = 5)
scrnaseq$kmeans5 <- as.character(cluster5$cluster)
plotReducedDim(scrnaseq, "PCA", colour_by = "kmeans5")

Cluster separation

When performing clustering, it is important for us to be able to measure how well our clusters are separated. One measure to test this is silhouette width, which measures how similar a data point is to points in the same cluster compared to other clusters. The silhouette width is computed for every observation and can range from -1 to 1. A high silhouette width means an observation is closer to other observations within the same cluster. For each cluster, the silhouette widths can then be averaged or an overall average can be taken.

More detail on silhouette widths: In more detail, each observation’s silhouette width is computed as follows:

  1. Compute the average distance between the focal observation and all other observations in the same cluster.

  2. For each of the other clusters, compute the average distance between focal observation and all observations in the other cluster. Keep the smallest of these average distances.

  3. Subtract (1.)-(2.) then divivde by whichever is smaller (1.) or (2).

Ideally, we would have only large positive silhouette widths, indicating that each data point is much more similar to points within its cluster than it is to the points in any other cluster. However, this is rarely the case. Often, clusters are very fuzzy, and even if we are relatively sure about the existence of discrete groupings in the data, observations on the boundaries can be difficult to confidently place in either cluster.

Here we use the silhouette() function from the cluster package to calculate the silhouette width of our K-means clustering using a distance matrix of distances between points in the clusters.

Silhoutte width of K-means clustering

# Visualizing Silhouette Widths for K-means Clustering
# Import the required package
library("cluster")

# Compute the distance matrix for principal components
dist_mat <- dist(pcs)

# Calculate silhouette widths for K-means clustering
sil <- silhouette(cluster$cluster, dist = dist_mat)

# Plot the silhouette widths
plot(sil,
     main = "Silhouette Widths for K-means Clustering",  # Improved and clear title
     border = NA,  # Remove borders for better aesthetics
     col = rainbow(length(unique(cluster$cluster)))
     )  # Color each cluster uniquely

Let’s plot the silhouette score on the original dimensions used to cluster the data. Here, we’re mapping cluster membership to point shape, and silhouette width to colour.

pc <- as.data.frame(pcs)
colnames(pc) <- c("x", "y")
pc$sil <- sil[, "sil_width"]
pc$clust <- factor(cluster$cluster)
mean(sil[, "sil_width"])
## [1] 0.7132479
library(ggplot2)

ggplot(pc) +
    aes(x = x, y = y, shape = as.factor(clust), colour = sil) +
    geom_point(size = 2, alpha = 0.9) +  # Adjust size and transparency for better visibility
    scale_colour_gradient2(
        low = "skyblue", mid = "#8494FF", high = "tomato", midpoint = 0.4, 
        name = "Silhouette\nScore"  # Add a descriptive legend title
    ) +
    scale_shape_manual(
        values = setNames(16:19, 1:4),  # Use filled shapes for better visibility
        name = "Cluster"  # Add a legend title
    ) +
    labs(
        title = "Cluster Visualization with Silhouette Scores",
        subtitle = "Scatter plot coloured according to silhouette width, grouped according to cluster membership",
        x = "Principal Component 1",
        y = "Principal Component 2"
    ) +
    theme_minimal() +  
    theme(
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        axis.text = element_text(size = 10)
    )

This plot shows that silhouette values for individual observations tends to be very high in the centre of clusters, but becomes quite low towards the edges. This makes sense, as points that are “between” two clusters may be more similar to points in another cluster than they are to the points in the cluster one they belong to.

Extension Challenge 2

  1. Calculate the silhouette width for the k of 5 clustering we did earlier. Are 5 clusters appropriate? Why/why not?

  2. Can you identify where the differences lie?

Show Answer

1. Calculating the silhoutte width; if the average silhouette width is close to or above 0.5, the clustering is generally appropriate. If below 0.5, it suggests poor cluster structure.

Here, sil_width = 0.58; therefore, clustering generally appropriate.

sil5 <- silhouette(cluster5$cluster, dist = dist_mat)
scrnaseq$kmeans5 <- as.character(cluster5$cluster)
plotReducedDim(scrnaseq, "PCA", colour_by = "kmeans5")

#Scatter plot of principal component 2 against principal component 1 
#in the scRNAseq data, coloured by clusters produced by k-means clustering.

mean(sil5[, "sil_width"])
## [1] 0.5914255

2. The average silhouette width is lower when k=5.

plot(sil5, border = NA)

This seems to be because some observations in clusters 3 and 5 seem to be more similar to other clusters than the one they have been assigned to. This may indicate that k is too high.

Gap statistic:

Another measure of how good our clustering is is the “gap statistic”. This compares the observed squared distance between observations in a cluster and the centre of the cluster to an “expected” squared distances. The expected distances are calculated by randomly distributing cells within the range of the original data. Larger values represent lower squared distances within clusters, and thus better clustering. We can see how this is calculated in the following example.

library("cluster")
gaps <- clusGap(pcs, kmeans, K.max = 20, iter.max = 20)
best_k <- maxSE(gaps$Tab[, "gap"], gaps$Tab[, "SE.sim"])
best_k
## [1] 5
plot(gaps$Tab[,"gap"], xlab = "Number of clusters", ylab = "Gap statistic")
abline(v = best_k, col = "red")

Cluster robustness: bootstrapping

When we cluster data, we want to be sure that the clusters we identify are not a result of the exact properties of the input data. That is, we want to ensure that the clusters identified do not change substantially if the observed data change slightly. This makes it more likely that the clusters can be reproduced.

To assess this, we can bootstrap. We first resample the data with replacement to reproduce a ‘new’ data set. We can then calculate new clusters for this data set and compare these to those on the original data set, thus helping us to see how the clusters may change for small changes in the data. This is maybe easier to see with an example. First, we define some data:

data <- 1:5

Then, we can take a sample from this data without replacement:

sample(data, 5)
## [1] 1 2 3 5 4

This sample is a subset of the original data, and points are only present once. This is the case every time even if we do it many times:

## Each column is a sample
replicate(10, sample(data, 5))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    5    4    2    2    2    5    4    1    2     5
## [2,]    3    5    1    1    3    1    5    4    3     1
## [3,]    4    3    4    3    5    2    3    3    4     3
## [4,]    2    2    3    4    4    4    2    2    5     4
## [5,]    1    1    5    5    1    3    1    5    1     2

However, if we sample with replacement, then sometimes individual data points are present more than once.

replicate(10, sample(data, 5, replace = TRUE))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    5    3    1    5    1    1    4    4    5     4
## [2,]    1    3    5    1    1    3    5    1    4     2
## [3,]    5    3    2    5    3    1    2    4    3     2
## [4,]    2    3    5    2    2    4    2    3    1     4
## [5,]    3    2    2    1    3    5    5    1    1     4

Bootstrapping

Bootstrapping is a powerful and common statistical technique.

We would like to know about the sampling distribution of a statistic, but we don’t have any knowledge of its behaviour under the null hypothesis.

For example, we might want to understand the uncertainty around an estimate of the mean of our data. To do this, we could resample the data with replacement and calculate the mean of each average.

boots <- replicate(1000, mean(sample(data, 5, replace = TRUE)))
hist(boots,
    breaks = "FD",
    main = "1,000 bootstrap samples",
    xlab = "Mean of sample"
 )

In this case, the example is simple, but it’s possible to devise more complex statistical tests using this kind of approach.

The bootstrap, along with permutation testing, can be a very flexible and general solution to many statistical problems.

In applying the bootstrap to clustering, we want to see two things:

1. Will observations within a cluster consistently cluster together in different bootstrap replicates?

2. Will observations frequently swap between clusters?

In the plot below, the diagonal of the plot shows how often the clusters are reproduced in boostrap replicates. High scores on the diagonal mean that the clusters are consistently reproduced in each boostrap replicate. Similarly, the off-diagonal elements represent how often observations swap between clusters in bootstrap replicates. High scores indicate that observations rarely swap between clusters.

library("pheatmap")
library("bluster")
library("viridis")

km_fun <- function(x) {
    kmeans(x, centers = 4)$cluster
}
ratios <- bootstrapStability(pcs, FUN = km_fun, clusters = cluster$cluster)
pheatmap(ratios,
    cluster_rows = FALSE, cluster_cols = FALSE,
    col = viridis(10),
    breaks = seq(0, 1, length.out = 10)
)

#Grid of empirical cluster swapping behaviour estimated by the bootstrap samples.

Yellow boxes indicate values slightly greater than 1, which may be observed. These are “good” (despite missing in the colour bar).

Extension Challenge 3:

  1. Repeat the bootstrapping process with k=5. Do the results appear better or worse? Can you identify where the differences occur on the plotReducedDim()?

Extension Challenge 3 Solution:

Above is a problem for you to solve; when you’re ready and have had an attempt yourself, click on “Show Answer” to reveal the code.

Show Answer
# Plot grid of empirical cluster swapping behaviour estimated by the bootstrap samples.
km_fun5 <- function(x) {
    kmeans(x, centers = 5)$cluster
}
set.seed(42)
ratios5 <- bootstrapStability(pcs, FUN = km_fun5, clusters = cluster5$cluster)
pheatmap(ratios5,
    cluster_rows = FALSE, cluster_cols = FALSE,
    col = viridis(10),
    breaks = seq(0, 1, length.out = 10)
)

When k=5, we can see that the values on the diagonal of the matrix are smaller, indicating that the clusters aren’t exactly reproducible in the bootstrap samples.

Similarly, the off-diagonal elements are considerably lower for some elements. This indicates that observations are “swapping” between these clusters in bootstrap replicates